Writing a great story for data science projects - Summer 2025

This is a Report Template Quarto

Author

Colby Fenters & Lilith Holland (Advisor: Dr. Cohen)

Published

July 16, 2025

Slides: slides.html ( Go to slides.qmd to edit)

Important

Remember: Your goal is to make your audience understand and care about your findings. By crafting a compelling story, you can effectively communicate the value of your data science project.

Carefully read this template since it has instructions and tips to writing!

Introduction

Currently a work in progress

For this project we hope to compare the predictive power of a traditional linear model against a graph neural network on weather station data.

The weather station data is comprised of weather station nodes that are within the same geographic location. For the linear model two variants will be produced, one that is trained on all data treating each station as a node and another trained on each station individually.

The GNN will be trained on the entire graph of the weather station data using a dense adjacency matrix with no self connection.

The training will focus on producing forecasts with the performance being compared over a 6 month period on the tail end of the data.

Methods

This section will be expanded as the cleaning process is further refined

  • Detail the models or algorithms used.

  • Justify your choices based on the problem and data.

The common non-parametric regression model is \(Y_i = m(X_i) + \varepsilon_i\), where \(Y_i\) can be defined as the sum of the regression function value \(m(x)\) for \(X_i\). Here \(m(x)\) is unknown and \(\varepsilon_i\) some errors. With the help of this definition, we can create the estimation for local averaging i.e. \(m(x)\) can be estimated with the product of \(Y_i\) average and \(X_i\) is near to \(x\). In other words, this means that we are discovering the line through the data points with the help of surrounding data points. The estimation formula is printed below (R-base?):

\[ M_n(x) = \sum_{i=1}^{n} W_n (X_i) Y_i \tag{1} \]\(W_n(x)\) is the sum of weights that belongs to all real numbers. Weights are positive numbers and small if \(X_i\) is far from \(x\).

Another equation:

\[ y_i = \beta_0 + \beta_1 X_1 +\varepsilon_i \]

Analysis and Results

Data Exploration and Visualization

  • Describe your data sources and collection process.

  • Present initial findings and insights through visualizations.

  • Highlight unexpected patterns or anomalies.

A study was conducted to determine how…

Code
reticulate::repl_python()
exit
Code
# %pip install polars==1.22.0
# %pip install numpy
# %pip install pandas
# %pip install seaborn
# %pip install matplotlib
# %pip install pyarrow
# %pip install datetime
# %pip install contextily
# %pip install h3==3.7.7
# %pip install shapely
# %pip install geopandas
# %pip install scikit-learn
# %pip install tqdm
# %pip install hypothesis==6.135.26
Code
# Standard library imports (e.g., OS, datetime, etc.)
import os
import typing
import datetime
from pathlib import Path
import shutil

# Third-party libraries
import numpy as np
import polars as pl
import seaborn as sns
import pandas as pd
from polars.testing.parametric import columns
from sklearn.preprocessing import minmax_scale, robust_scale
from tqdm import tqdm
import geopy.distance
import scipy
import sklearn.preprocessing as pre
import geopandas
import shapely
from shapely.geometry import Polygon, Point, LineString
import matplotlib.pyplot as plt
import matplotlib.animation as animation

import contextily as ctx
from h3 import h3
Code
import datetime
import polars as pl

start_date = datetime.datetime(2010, 1, 1, 0, 0)
end_date = datetime.datetime(2020, 12, 31, 0, 0)

seed = 3435

pl.enable_string_cache()

data_path = r'kansas_asos_2010_2020.csv'
Code
metar_schema = {'station': pl.Categorical,
                'valid': pl.Datetime,
                'lon': pl.Float64,
                'lat': pl.Float64,
                'elevation': pl.Float64,
                'tmpf': pl.Float64,
                'dwpf': pl.Float64,
                'relh': pl.Float64,
                'drct': pl.Float64,
                'sknt': pl.Float64,
                'p01i': pl.Float64,
                'alti': pl.Float64,
                'mslp': pl.Float64,
                'vsby': pl.Float64,
                'gust': pl.Float64,
                'skyc1': pl.Categorical,
                'skyc2': pl.Categorical,
                'skyc3': pl.Categorical,
                'skyc4': pl.Categorical,
                'skyl1': pl.Float64,
                'skyl2': pl.Float64,
                'skyl3': pl.Float64,
                'skyl4': pl.Float64,
                'wxcodes': pl.String,
                'ice_accretion_1hr': pl.Float64,
                'ice_accretion_3hr': pl.Float64,
                'ice_accretion_6hr': pl.Float64,
                'peak_wind_gust': pl.Float64,
                'peak_wind_drct': pl.Float64,
                'peak_wind_time': pl.Datetime,
                'feel': pl.Float64,
                'metar': pl.String,
                'snowdepth': pl.Float64}
Code
asos_ldf = pl.scan_csv(data_path, null_values=['T', 'M', '///'], schema=metar_schema)\
    .drop('metar')\
    .with_columns(pl.col('valid').dt.round('1h').alias('valid'))
Code
import numpy as np

full_date_series = np.arange(start_date, end_date, datetime.timedelta(hours=1))

asos_df = asos_ldf\
    .collect()\
    .select(pl.col('station', 'lat', 'lon', 'elevation'))\
    .unique()\
    .join(pl.DataFrame({'valid': full_date_series}), how='cross')\
    .join(asos_ldf.collect(), on=['station', 'valid'], how='left')\
    .with_columns(pl.col('valid').dt.round('6h').alias('valid'))\
    .drop('lat_right', 'lon_right', 'elevation_right')\
    .group_by(['station', 'valid'])\
    .mean()\
    .with_columns(pl.col(pl.Float64).cast(pl.Float32))
Code
potential_features = asos_ldf.drop('valid', 'station', 'lat', 'lon', 'elevation').collect_schema().names()
feature_list = []

for feature in potential_features:
    if not asos_df.select(pl.col(feature).is_null().all()).item():
        feature_list.append(feature)

stations_list = asos_df\
    .select(pl.col('station'))\
    .unique()\
    .to_series()\
    .to_list()
Code
from tqdm import tqdm

def safe_index(item, lst):
    return item in lst
  
year_series = np.arange(start_date.year, end_date.year + 1, 1)
reduced_feature_df = pl.DataFrame(schema={**{'start_year': pl.Int64, 'end_year': pl.Int64, 'year_range': pl.Int64, 'year_label': pl.String, 'feature': pl.String},
                                          **{station: pl.Boolean for station in stations_list}
                                          })

for index, a in enumerate(tqdm(year_series)):
    for b in year_series[index:]:
        shifted_date_series = np.arange(datetime.date(a, 1, 1), datetime.date(b, 12, 31), datetime.timedelta(hours=6))
        year_filter_df = asos_df.filter(pl.col('valid').dt.year().is_between(a, b))
        for feature in feature_list:
            if (year_filter_df.select(pl.col(feature).null_count()) != year_filter_df.height).item():
                asos_pivot_df = year_filter_df\
                    .pivot(on='station', index='valid', values=feature, aggregate_function='mean')\
                    .drop('valid')
                valid_stations = [s.name for s in asos_pivot_df if not (s.null_count() > len(shifted_date_series)*0.1)]
                if len(valid_stations) >= 6:
                    if asos_pivot_df.var().select(pl.mean_horizontal(pl.all()).alias('mean')).item() > 5:
                        valid_stations.sort()
                        new_row = pl.DataFrame({**{'start_year': a,
                                                   'end_year': b,
                                                   'year_range': (b+1) - a,
                                                   'year_label': f'{a}-{b}',
                                                   'feature': feature},
                                                **{station: safe_index(station, valid_stations) for station in stations_list}
                                                })
                        reduced_feature_df = reduced_feature_df.vstack(new_row)

  0%|          | 0/11 [00:00<?, ?it/s]
  9%|9         | 1/11 [00:01<00:10,  1.06s/it]
 18%|#8        | 2/11 [00:01<00:08,  1.04it/s]
 27%|##7       | 3/11 [00:02<00:06,  1.16it/s]
 36%|###6      | 4/11 [00:03<00:05,  1.31it/s]
 45%|####5     | 5/11 [00:03<00:04,  1.50it/s]
 55%|#####4    | 6/11 [00:04<00:02,  1.75it/s]
 64%|######3   | 7/11 [00:04<00:01,  2.07it/s]
 73%|#######2  | 8/11 [00:04<00:01,  2.51it/s]
 82%|########1 | 9/11 [00:04<00:00,  3.12it/s]
100%|##########| 11/11 [00:04<00:00,  5.01it/s]
100%|##########| 11/11 [00:04<00:00,  2.21it/s]
Code
import matplotlib.pyplot as plt
import seaborn as sns

features = reduced_feature_df.select(pl.col('feature')).unique().to_series().to_list()

n_cols = 3
n_rows = -(-len(features) // n_cols)  # Ceiling division

fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 4, n_rows * 3), constrained_layout=True,
                         sharex=True, sharey=True)
axes = axes.flatten()

for idx, feature in enumerate(features):
    plot_df = (
        reduced_feature_df
        .filter(pl.col('feature') == feature)
        .drop('feature', 'start_year', 'end_year', 'year_range')
        .to_pandas()
    )

    plot_df.index = plot_df['year_label'].astype(str)  # Ensure index is str, not int
    plot_df = plot_df.drop('year_label', axis=1)
    plot_df = plot_df.sort_index()

    ax = axes[idx]
    sns.heatmap(plot_df, cmap='magma', annot=True, cbar=False, ax=ax)

    ax.set_title(feature)

# Shared axis labels
fig.suptitle("Station Participation by Feature and Year", fontsize=16)
fig.supxlabel("Stations")
fig.supylabel("Year")

plt.show()

Code
df_t = reduced_feature_df\
    .filter(pl.col('year_label').eq('2018-2020'))\
    .drop('year_range', 'year_label', 'start_year', 'end_year')\
    .transpose(include_header=True)


new_headers = df_t.row(0)

# Drop the first row
df_t_no_header = df_t.slice(1, df_t.height - 1)

# Assign new headers
df_t_renamed = df_t_no_header.rename({old: str(new) for old, new in zip(df_t_no_header.columns, new_headers)})
Code
valid_features = [col for col in df_t_renamed.columns if col != "feature"]
valid_stations = df_t_renamed\
    .with_columns(pl.col(valid_features)\
                  .map_elements(lambda x: x == 'true', return_dtype=pl.Boolean))\
    .filter( pl.all_horizontal([pl.col(col) for col in valid_features]))\
    .select(pl.col('feature'))\
    .to_series()\
    .to_list()
Code
reduced_asos_df = asos_df\
    .filter(pl.col('valid').is_between(datetime.datetime(2018, 1, 1, 0, 0), datetime.datetime(2021, 1, 1, 0, 0)))\
    .filter(pl.col('station').is_in(valid_stations))\
    .select(pl.col(['station', 'valid', 'lat', 'lon', 'elevation'] + valid_features))\
    .sort(['valid', 'station'])\
    .with_columns([(pl.col('drct').map_elements(lambda x: np.sin(np.radians(x)), return_dtype=pl.Float64)).alias('drct_sin'),
                   (pl.col('drct').map_elements(lambda x: np.cos(np.radians(x)), return_dtype=pl.Float64)).alias('drct_cos')])\
    .drop('drct')

row_count, feature_count = reduced_asos_df.drop('station', 'valid', 'lat', 'lon', 'elevation').shape
valid_station = reduced_asos_df.select(pl.col('station')).head(7).to_series().to_list()
station_count = len(valid_station)
valid_features = reduced_asos_df.drop('station', 'valid', 'lat', 'lon', 'elevation').columns

# shape is time, station, feature
station_matrix = reduced_asos_df.drop('station', 'valid', 'lat', 'lon', 'elevation').to_numpy().reshape(int(row_count/station_count), station_count, feature_count)
Code
reduced_asos_df
shape: (30_667, 12)
station valid lat lon elevation tmpf dwpf relh sknt feel drct_sin drct_cos
cat datetime[μs] f32 f32 f32 f32 f32 f32 f32 f32 f64 f64
"GCK" 2018-01-01 00:00:00 37.927502 -100.724403 881.0 9.283334 -6.5 48.148335 8.666667 -4.161667 0.851117 0.524977
"LBL" 2018-01-01 00:00:00 37.044201 -100.9599 879.0 12.316667 -1.983333 52.014999 10.166667 -1.721667 0.664796 0.747025
"EHA" 2018-01-01 00:00:00 37.000801 -101.879997 1099.0 15.555555 5.255556 63.382778 7.388889 4.482222 0.970763 0.24004
"HQG" 2018-01-01 00:00:00 37.163101 -101.370499 956.52002 14.311111 -1.605556 48.468334 7.777778 2.681667 0.936332 0.351115
"3K3" 2018-01-01 00:00:00 37.991699 -101.7463 1005.700012 13.1 -0.9 53.127777 6.777778 2.151111 0.981255 0.192712
"EHA" 2020-12-31 00:00:00 37.000801 -101.879997 1099.0 42.355556 16.588888 34.955555 3.0 40.254444 -0.533205 -0.845986
"HQG" 2020-12-31 00:00:00 37.163101 -101.370499 956.52002 40.722221 13.255555 32.23111 1.666667 39.450001 0.977334 -0.211704
"3K3" 2020-12-31 00:00:00 37.991699 -101.7463 1005.700012 40.200001 12.2 31.377777 4.555555 36.683334 -0.700217 -0.71393
"JHN" 2020-12-31 00:00:00 37.578201 -101.7304 1012.710022 40.711113 17.4 38.554443 4.777778 37.06889 -0.824675 -0.565607
"19S" 2020-12-31 00:00:00 37.496899 -100.832901 892.570007 39.922222 15.388889 36.552223 6.111111 35.113335 -0.824675 0.565607
Code
# as this is true it means there are no time slices where all values are nan
not np.any(np.all(np.isnan(station_matrix), axis=(1, 2)))
True
Code
import geopy.distance
# calculate the geodesic distance between two nodes
def compute_node_distance(node1, node2, inverse=False):
    coords_1 = (node1[1], node1[0])
    coords_2 = (node2[1], node2[0])
    # elevation_difference = abs(node1[2] - node2[2])
    horizontal_distance = geopy.distance.geodesic(coords_1, coords_2).km
    # return math.sqrt(horizontal_distance**2 + (elevation_difference/1000)**2)
    if inverse:
        try:
            horizontal_distance = 1/horizontal_distance
        except ZeroDivisionError:
            horizontal_distance = 0
    return horizontal_distance
Code
station_df = reduced_asos_df.select(pl.col(['station', 'lon', 'lat', 'elevation'])).unique().to_pandas()
grid_list = station_df.loc[:, ['lon', 'lat']].reset_index()[['lon', 'lat', 'index']].to_numpy().tolist()
grid_list = [sublist[:-1] + [int(sublist[-1])] for sublist in grid_list]


result_dict = {'index': [],
               'station': []}
result_dict = {**result_dict, **{str(i): [] for _, _, i in grid_list}}

# find the nearest mesh node for every station node
for row_index, station in station_df.iterrows():
    for col_index, station2 in station_df.iterrows():
        # To get elevation with the new system I would have to find a heightmap for the new lon/lat coordinates
        result_dict[str(col_index)].append(compute_node_distance([station['lon'], station['lat']], [station2['lon'], station2['lat']], inverse=True))
    result_dict['station'].append(station['station'])
    result_dict['index'].append(row_index)

# # generate a dataframe that maps every station node to its relative mesh node with the inverse distance
grid_map_df = pl.DataFrame(result_dict, schema={**{'station': pl.Categorical, 'index': pl.UInt64}, **{str(i): pl.Float64 for _, _, i in grid_list}})
Code
from sklearn.preprocessing import MinMaxScaler, minmax_scale

scaled_idistance = minmax_scale(grid_map_df.drop('station', 'index'))
modified_adjacency_df = grid_map_df.select(pl.col(['station', 'index']))\
    .join(pl.DataFrame(scaled_idistance, schema=[str(i) for i, _ in enumerate(scaled_idistance)]).with_row_index(),
          on='index')
Code
from shapely.geometry import Polygon, Point, LineString
import geopandas
import contextily as ctx
# Create Point geometries for each station (lon, lat order)
station_node_list = [Point(lon, lat) for lat, lon in station_df[['lat', 'lon']].to_numpy()]
stations_gdf = geopandas.GeoDataFrame(station_df.copy(), geometry=station_node_list, crs="EPSG:4326")

# Create edges as LineStrings between all unique pairs of stations
edge_list = []
n = len(station_node_list)
for i in range(n):
    for j in range(i + 1, n):
        edge_list.append(LineString([station_node_list[i], station_node_list[j]]))

edges_gdf = geopandas.GeoDataFrame(geometry=edge_list, crs="EPSG:4326")

# Plot everything
fig, ax = plt.subplots(figsize=(18, 10))

# Plot edges
edges_gdf.plot(ax=ax, color="xkcd:blue", linewidth=1, alpha=1)

# Plot nodes
stations_gdf.plot(ax=ax, color="xkcd:bright orange", markersize=10)

# Add station labels
for i, row in stations_gdf.iterrows():
    ax.text(row.geometry.x, row.geometry.y, row['station'], fontsize=9)

# Add basemap
ctx.add_basemap(ax, crs=stations_gdf.crs.to_string())

Code
adj = modified_adjacency_df.drop('station', 'index').to_numpy()
adj = adj / adj.sum(axis=1, keepdims=True)  # row normalize

def spatial_impute(data, adj):
    imputed = data.copy()
    T, N, F = data.shape

    for t in range(T):  # time index
        for i in range(N):  # station index
            for f in range(F):  # feature index
                if np.isnan(imputed[t, i, f]):
                    # Get neighbor values and weights
                    neighbor_vals = imputed[t, :, f]
                    weights = adj[i]
                    mask = ~np.isnan(neighbor_vals)

                    if mask.sum() > 0:
                        imputed[t, i, f] = np.dot(weights[mask], neighbor_vals[mask]) / weights[mask].sum()

    return imputed

def spatiotemporal_impute(data, adj):
    data = spatial_impute(data, adj)

    # fill remaining NaNs using temporal interpolation
    T, N, F = data.shape
    for i in tqdm(range(N)):
        for f in range(F):
            series = data[:, i, f]
            mask = ~np.isnan(series)
            if mask.sum() == 0:
                continue  # entire feature missing, can't interpolate
            # Linear interpolation
            indices = np.arange(T)
            data[:, i, f] = np.interp(indices, indices[mask], series[mask])

    return data

data_imputed = spatiotemporal_impute(station_matrix, adj)

  0%|          | 0/7 [00:00<?, ?it/s]
100%|##########| 7/7 [00:00<00:00, 4667.75it/s]
Code
import pandas as pd
pd.DataFrame(data_imputed[:200, :, 2]).plot(legend=False, subplots=True, figsize=(20, 8))
array([<Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >,
       <Axes: >], dtype=object)
Code
pd.DataFrame(station_matrix[:200, :, 2]).plot(legend=False, subplots=True, figsize=(20, 8))
array([<Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >,
       <Axes: >], dtype=object)

Code
def node_correlation(data, node_number):
    T, N, F = data.shape

    # Store correlation matrices per node (features x features)
    corr_within_node = np.empty((N, F, F))

    for node in range(N):
        # Slice data for node: shape (T, F)
        node_data = data[:, node, :]
        # Compute correlation matrix (features x features)
        corr_within_node[node] = np.corrcoef(node_data, rowvar=False)

    # Example: correlation between feature 0 and feature 1 at node 0
    corr_list = []
    for f_feature in range(F):
        inner_list = []
        for s_feature in range(F):
            if f_feature < s_feature:
                inner_list.append(np.nan)
            else:
                inner_list.append(float(corr_within_node[node_number, f_feature, s_feature]))
        corr_list.append(inner_list)
    return corr_list
  
def between_node_correlation(data, feature_number):

    feature_data = data[:, :, feature_number]

    corr_between_nodes = np.corrcoef(feature_data.T)
    corr_between_nodes[np.triu_indices(corr_between_nodes.shape[0], 1)] = np.nan

    return corr_between_nodes

Code
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
axes = axes.flatten()

for i in range(data_imputed.shape[1]):
    corr_matrix = pd.DataFrame(
        node_correlation(data_imputed, i),
        columns=valid_features,
        index=valid_features
    )
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, ax=axes[i])
    axes[i].set_title(f'{valid_station[i]} Correlation')

plt.tight_layout()
plt.show()

Code
fig, axes = plt.subplots(2, 4, figsize=(20, 10))  # 2 rows, 4 columns grid
axes = axes.flatten()

for i in range(data_imputed.shape[2]):
    corr_matrix = pd.DataFrame(
        between_node_correlation(data_imputed, i),
        columns=valid_station,
        index=valid_station
    )
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, ax=axes[i])
    axes[i].set_title(f'Between-Node Correlation: {valid_features[i]}')

plt.tight_layout()
plt.show()

Code
from sklearn.preprocessing import robust_scale

T, N, F = data_imputed.shape
imputed_df = pl.from_numpy(data_imputed.reshape(T * N, F), schema=valid_features)\
    .drop('dwpf', 'feel')\
    .with_columns(pl.DataFrame([valid_station[i % len(valid_station)] for i in range(T*N)], schema={'station': pl.Categorical})\
          .join(pl.from_pandas(station_df),
                on='station',
                how='left'))\
    .select(pl.col(['station', 'lon', 'lat', 'elevation', 'tmpf', 'relh', 'sknt', 'drct_sin', 'drct_cos']))\
    .drop('lon', 'lat', 'elevation')\
    .with_columns(pl.col('relh')/100)

imputed_df = imputed_df.drop('tmpf', 'sknt')\
    .hstack(pl.DataFrame(robust_scale(imputed_df.select(pl.col('tmpf'))), schema=['tmpf']))\
    .hstack(pl.DataFrame(robust_scale(imputed_df.select(pl.col('sknt'))), schema=['sknt']))\
    .select(pl.col(['station', 'tmpf', 'relh', 'sknt', 'drct_sin', 'drct_cos']))

imputed_df
shape: (30_667, 6)
station tmpf relh sknt drct_sin drct_cos
cat f64 f64 f64 f64 f64
"GCK" -1.355721 0.481483 -0.080357 0.851117 0.524977
"LBL" -1.265174 0.52015 0.160714 0.664796 0.747025
"EHA" -1.168491 0.633828 -0.285714 0.970763 0.24004
"HQG" -1.205638 0.484683 -0.223214 0.936332 0.351115
"3K3" -1.241791 0.531278 -0.383929 0.981255 0.192712
"EHA" -0.368491 0.349556 -0.991072 -0.533205 -0.845986
"HQG" -0.417247 0.322311 -1.205357 0.977334 -0.211704
"3K3" -0.432836 0.313778 -0.741072 -0.700217 -0.71393
"JHN" -0.417579 0.385544 -0.705357 -0.824675 -0.565607
"19S" -0.441128 0.365522 -0.491072 -0.824675 0.565607
Code
# if the station is going to be used for linear regression apply a one-hot encoding to it
imputed_pd_df = imputed_df.to_pandas()
imputed_pl_df = imputed_df

Modeling and Results

  • Explain your data preprocessing and cleaning steps.

  • Present your key findings in a clear and concise manner.

  • Use visuals to support your claims.

  • Tell a story about what the data reveals.

Conclusion

  • Summarize your key findings.

  • Discuss the implications of your results.

References